In this part of the course, we will cover the following concepts:
| Topic | Complete |
|---|---|
| Summarize columns using the summarise and group by functions | |
| Understand tidy data and its advantages |
With a bit of dplyr experience, we can do a substantial amount of data wrangling
We can use filter to subset data, arrange to order observations, select to choose specific columns, and mutate to create new variables
That leaves two functions left to unpack, among the most commonly used in dplyr:
summarise(), to create a new summary data framegroup_by(), to take an existing table and convert it to a grouped table so operations can be performed by groupsummarise() collapses a data frame down to a single rowsummarise is not very helpfulgroup_by()summarize() (spelled with a “z”)ungroup()nycflights13 dataset# A tibble: 1 x 1
delay
<dbl>
1 12.6
# Create `by_day` by grouping `flights` by year, month, and day.
by_day = group_by(flights, year, month, day)
by_day# A tibble: 336,776 x 19
# Groups: year, month, day [365]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
summarise() and group_by() have some usefulnesssummarise() and its arguments on grouped data, you can generate new insights# Now use grouped `by_day` data and summarise it to see the average delay by year, month and day.
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))# A tibble: 365 x 4
# Groups: year, month [12]
year month day delay
<int> <int> <int> <dbl>
1 2013 1 1 11.5
2 2013 1 2 13.9
3 2013 1 3 11.0
4 2013 1 4 8.95
5 2013 1 5 5.73
6 2013 1 6 7.15
7 2013 1 7 5.42
8 2013 1 8 2.55
9 2013 1 9 2.28
10 2013 1 10 2.84
# … with 355 more rows
%>%)%>% as delays = flights %>% #<- take flights data
group_by(dest) %>% #<- group it by destination
summarise(count = n(), #<- then summarize by creating count variable
dist = mean(distance, na.rm = TRUE), #<- and computing mean distance
delay = mean(arr_delay, na.rm = TRUE)) %>% #<- and mean arrival delay
filter(count > 20, dest != "HNL") #<- then filter it`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 96 x 4
dest count dist delay
<chr> <int> <dbl> <dbl>
1 ABQ 254 1826 4.38
2 ACK 265 199 4.85
3 ALB 439 143 14.4
4 ATL 17215 757. 11.3
5 AUS 2439 1514. 6.02
6 AVL 275 584. 8.00
7 BDL 443 116 7.05
8 BGR 375 378 8.03
9 BHM 297 866. 16.9
10 BNA 6333 758. 11.8
# … with 86 more rows
NAs# A tibble: 365 x 4
# Groups: year, month [12]
year month day mean
<int> <int> <int> <dbl>
1 2013 1 1 NA
2 2013 1 2 NA
3 2013 1 3 NA
4 2013 1 4 NA
5 2013 1 5 NA
6 2013 1 6 NA
7 2013 1 7 NA
8 2013 1 8 NA
9 2013 1 9 NA
10 2013 1 10 NA
# … with 355 more rows
NAs, the aggregation functions will return NAs for each item if there is just one NA in the inputNAs# A tibble: 365 x 4
# Groups: year, month [12]
year month day mean
<int> <int> <int> <dbl>
1 2013 1 1 11.5
2 2013 1 2 13.9
3 2013 1 3 11.0
4 2013 1 4 8.95
5 2013 1 5 5.73
6 2013 1 6 7.15
7 2013 1 7 5.42
8 2013 1 8 2.55
9 2013 1 9 2.28
10 2013 1 10 2.84
# … with 355 more rows
mean(), there are many other summary functions describing data from various aspects:| Summary Functions | Explanation |
|---|---|
n()
|
will count the number of entries that come from a summarise |
min(x), quantile(x, 0.25), max(x)
|
measures of rank and distribution can be used |
first(x), nth(x, 2), last(x)
|
measures of position and order |
n_distinct
|
will count the number of distinct values |
n() will count the number of entries that come from a summarise() functionn() can summarise(), mutate() and filter()flights %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay, na.rm = TRUE),
n = n()) #<- add a column with summary counts# A tibble: 365 x 5
# Groups: year, month [12]
year month day mean n
<int> <int> <int> <dbl> <int>
1 2013 1 1 11.5 842
2 2013 1 2 13.9 943
3 2013 1 3 11.0 914
4 2013 1 4 8.95 915
5 2013 1 5 5.73 720
6 2013 1 6 7.15 832
7 2013 1 7 5.42 933
8 2013 1 8 2.55 899
9 2013 1 9 2.28 902
10 2013 1 10 2.84 932
# … with 355 more rows
count() is a simple count function that count() is the rough equivalent of summarise(n = n())# A tibble: 31 x 2
day n
<int> <int>
1 1 11036
2 2 10808
3 3 11211
4 4 11059
5 5 10858
6 6 11059
7 7 10985
8 8 11271
9 9 10857
10 10 11227
# … with 21 more rows
summarise() to return the min(x), quantile(x, 0.25), and max(x) from the grouped data to display the departure times of the first and last flights# 1. Build a subset of all flights that were not canceled.
not_cancelled = flights %>%
filter(!is.na(dep_time)) #<- filter flights where `dep_time` was not `NA`
# 2. Group and summarize all flights that were not canceled to get desired results.
not_cancelled %>%
group_by(year, month, day) %>% #<- group the not canceled flights
summarise(first = min(dep_time), #<- then summarize them by calculating the first
last = max(dep_time)) #<- and last flights in the `dep_time` in each group# A tibble: 365 x 5
# Groups: year, month [12]
year month day first last
<int> <int> <int> <int> <int>
1 2013 1 1 517 2356
2 2013 1 2 42 2354
3 2013 1 3 32 2349
4 2013 1 4 25 2358
5 2013 1 5 14 2357
6 2013 1 6 16 2355
7 2013 1 7 49 2359
8 2013 1 8 454 2351
9 2013 1 9 2 2252
10 2013 1 10 3 2320
# … with 355 more rows
first() and last() functions# 1. Group and summarize all flights that were not canceled to get desired results using first and last functions.
not_cancelled %>%
group_by(year, month, day) %>% #<- group the not canceled flights
summarise(first = first(dep_time), #<- then summarize them by calculating the first
last = last(dep_time)) #<- and last flights in the `dep_time` in each group# A tibble: 365 x 5
# Groups: year, month [12]
year month day first last
<int> <int> <int> <int> <int>
1 2013 1 1 517 2356
2 2013 1 2 42 2354
3 2013 1 3 32 2349
4 2013 1 4 25 2358
5 2013 1 5 14 2357
6 2013 1 6 16 2355
7 2013 1 7 49 2359
8 2013 1 8 454 2351
9 2013 1 9 2 2252
10 2013 1 10 3 2320
# … with 355 more rows
n_distinct(x) will count the number of distinct values# Number of flights that take off, by day.
not_cancelled %>%
group_by(year, month, day) %>%
summarise(flights_that_take_off = n_distinct(dep_time)) #<- calculate distinct departure times# A tibble: 365 x 4
# Groups: year, month [12]
year month day flights_that_take_off
<int> <int> <int> <int>
1 2013 1 1 552
2 2013 1 2 583
3 2013 1 3 589
4 2013 1 4 589
5 2013 1 5 495
6 2013 1 6 564
7 2013 1 7 572
8 2013 1 8 573
9 2013 1 9 580
10 2013 1 10 572
# … with 355 more rows
group_by() adds metadata to the dataframe that impacts its usability downstreamungroup() after every group_by()# Take the same `not_canceled` data, but now group by month instead of by day.
not_cancelled %>% #<- set data frame
ungroup() %>% #<- first ungroup it
group_by(year, month) %>% #<- then group by year and month
summarise(flights_by_year = n_distinct(dep_time))#<- then do the rest ...# A tibble: 12 x 3
# Groups: year [1]
year month flights_by_year
<int> <int> <int>
1 2013 1 1165
2 2013 2 1171
3 2013 3 1199
4 2013 4 1216
5 2013 5 1186
6 2013 6 1220
7 2013 7 1242
8 2013 8 1204
9 2013 9 1156
10 2013 10 1139
11 2013 11 1135
12 2013 12 1191
| Topic | Complete |
|---|---|
| Summarize columns using the summarise and group by functions |
✔ |
| Understand tidy data and its advantages |
# A tibble: 12 x 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
These three interrelated rules make a dataset tidy:
tidy_country, the example below, follows all three rules.
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Storing data in a consistent way:
Making use of R’s internal vectorization:
Making use of pivot_longer and pivot_wider:
tidyr that help transform messy data to tidy data| Topic | Complete |
|---|---|
| Summarize columns using the summarise and group by functions |
✔ |
| Understand tidy data and its advantages |
✔ |
You are now ready to try Tasks 1-3 in the Exercise for this topic